a post-clustering differential expression method
July 1, 2024
RNA carries the genetic information specific in DNA. There are two main types of RNA.
Non-coding RNA performs some biological function.
Messenger RNA forms a template for protein production (it codes for a protein which performs some biological function).
TODO: give an explanation of scRNA-seq data collection and analysis.
Biologists are interested in determining the cell types of the cells in a sample.
A natural approach:
However, we used the data twice.
First, we clustered the cells. The algorithm chose these two specific clusters because, in some sense, these two subsets of cells look the most different in terms of their gene expression levels.
Next, we test each gene to see how different the expression levels are between the two clusters. However, we already know that this variation is there: that’s how we got the clusters in the first place. So we are testing for variation that we already found through clustering.
In other words: we forced the genes to exhibit different distributions in the two clusters, so we are likely to find this forced variation instead of true variation in the dataset.
In words, the ClusterDE method can be broken up into four steps:
Generate a synthetic null dataset that mimics the structure (in particular, the gene-gene correlation structure) of the original data.
Separately partition the synthetic null data and the target data (real data) into two clusters.
Separately for the null and target data, perform hypothesis tests for differentially expressed genes between the two clusters. For each gene, compute some sort of difference between the p-values on the two datasets.
Output a subset of the significant results from step 3 as potential cell-type marker genes.
A graphical illustration of ClusterDE.
The synthetic null generation consists of three steps, as described in the following figure.
Null generation steps.
Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as
\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]
with associated probability density (or mass) function
\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]
for a \(d\)-dimensional copula \(C\) with copula density \(c\).
The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as
\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] , and the copula density (or mass) function is
\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .
Sklar’s Theorem allows us to choose \(C\), the specific copula function that we will use to approximate \(F\). For convenience, we will choose \(C\) to be a multivariate Gaussian distribution, since we know how to sample from it. Therefore, we have found a way to estimate the joint MVNB distribution of genes.
\[ C(\mathbf{u}; \mathbf{R}) = \Phi_{\mathbf{R}}(\Phi^{-1}(u_1), ..., \Phi^{-1}(u_m)) \]
Now, our goal is to estimate the parameters \({\mu_j, \sigma_j}_{j = 1}^m\) and \(\mathbf{R}\).
Recall that the power of the copula is that it allows us to consider the marginal distributions separately from the covariance structure. Therefore, we can proceed as follows.
For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.
Separately, use the Gaussian copula to model the dependence structure.
\[ \begin{bmatrix} \tilde{Z}_{11} & \dots & \tilde{Z}_{1m} \\ \vdots & \ddots & \\ \tilde{Z}_{n1} & & \tilde{Z}_{nm} \end{bmatrix} \]
\[ \begin{bmatrix} \hat{F}_1^{-1}(\Phi(\tilde{Z}_{11})) & \dots & \hat{F}_m^{-1}(\Phi(\tilde{Z}_{1m})) \\ \vdots & \ddots & \\ \hat{F}_1^{-1}(\Phi(\tilde{Z}_{n1})) & & \hat{F}_m^{-1}(\Phi(\tilde{Z}_{nm})) \end{bmatrix} \]
ClusterDE allows any clustering algorithm. Note that it only handles the case of two clusters, so if you started out with more clusters, you should identify a particular pair of interest. In the Practical guidelines for ClusterDE usage subsection, steps 1 and 2 describe how an analyst should proceed.
Given \(\geq 2\) clusters, identify 2 clusters of interest. Generally, this will be a pair for which you suspect the clustering is spurious (i.e. you think the two clusters actually come from the same cell type, so they are strong candidates to be merged into a single cluster).
Filter the data so that you only consider the subset of cells that come from those two clusters.
UMAP is common in scRNA-seq analysis.
TODO: summarize UMAP.
The example analyses in the presentation use the default Seurat clustering procedure, which uses the Louvain algorithm.
TODO: describe the Seurat clustering pipeline.
TODO: summarize the Louvain algorithm.
ClusterDE allows any DE test.
TODO: choose and summarize a common DE test(s), e.g. Wilcoxon.
Let \(P_1, ..., P_m\) be the p-values computed by the \(m\) DE tests on the target data. Define the target DE score \(S_j := -\log_{10}P_j\). Likewise for the synthetic null data.
TODO: motivate the formula for the DE score. It is essentially the information content of the p-value.
The final outputs of step 3: \(m\) target DE scores \(S_1, ..., S_m\); \(m\) null DE scores \(\tilde{S}_1, ..., \tilde{S}_m\).
Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).
FDR control.
We choose the minimum \(t\) to maximize our discoveries, i.e. to “use” all of the false discoveries that we can.